Home
AdvExplore
AdvGLUE
AdvGLUE
The Adversarial GLUE Benchmark
Performance of FreeLB (single model) on AdvGLUE
Overall Statistics
96.4
Accuracy
66.5
49.0
82.3
61.7
92.6
92.6
F1
Accuracy
33.1
27.2
40.5
87.7
78.9
42.2
31.1
95.0
Accuracy
70.6
57.7
47.4
62.3
86.7
Accuracy
66.7
56.6
62.2
90.6
Accuracy
38.5
21.1
31.6
90.6
0
100
Accuracy
35.7
0
100
18.1
0
100
26.4
0
100
27.6
0
100
GLUE Dev
AdvGLUE Word
AdvGLUE Sentence
AdvGLUE Human
AdvGLUE Overall
SST-2
QQP
QNLI
RTE
MNLI-m
MNLI-mm
plotly-logomark
Performance of FreeLB (single model) on each task
The Stanford Sentiment Treebank (SST-2)
53.1
79.1
67.6
69.1
61.4
Typo
Knowledge
Embedding
Context
Composition
39.4
63.5
Syntactic
Distraction
82.3
0
100
CheckList
Adversarial Acc
Word
Sentence
Human
plotly-logomark
Quora Question Pairs (QQP)
36.0
5.9
19.0
28.6
39.7
Typo
Knowledge
Embedding
Context
Composition
27.3
11.1
15.0
13.8
35.5
40.5
Syntactic
87.7
0
100
CheckList
78.9
0
100
Adversarial Acc
Adversarial F1
Word
Sentence
Human
plotly-logomark
MultiNLI (MNLI) matched
37.0
37.5
26.2
44.7
38.6
Typo
Knowledge
Embedding
Context
Composition
18.4
25.9
0
100
Syntactic
Distraction
Adversarial Acc
Word
Sentence
plotly-logomark
MultiNLI (MNLI) mismatched
45.9
34.5
29.6
27.6
35.4
Typo
Knowledge
Embedding
Context
Composition
14.2
24.8
Syntactic
Distraction
29.0
23.7
0
100
StressTest
ANLI
Adversarial Acc
Word
Sentence
Human
plotly-logomark
Question NLI (QNLI)
69.4
73.2
65.8
66.3
75.5
Typo
Knowledge
Embedding
Context
Composition
47.5
71.7
Syntactic
Distraction
64.9
35.0
0
100
CheckList
AdvSQuAD
Adversarial Acc
Word
Sentence
Human
plotly-logomark
Recognizing Textual Entailment (RTE)
65.2
67.7
81.4
63.0
54.5
Typo
Knowledge
Embedding
Context
Composition
47.7
72.9
0
100
Syntactic
Distraction
Adversarial Acc
Word
Sentence
plotly-logomark
AdvGLUE
UIUC Secure Learning Lab
Microsoft Research